This iPython Notebook was created as a companion to the "Advanced Python for Data Analysis" tutorial and class taught at Investigative Reporters and Editors' Computer-Assisted Reporting conference in March, 2015 in Atlanta, Ga.
Now that we're comfortable with our coding environment, and we've dabbled with the iPython notebook, it's time to dive into pandas, the Python library created explicitly for data analysis.
First in our interactive iPython notebook, we'll import pandas and call it "pd", a convention of pandas users that you'll quickly grow accustomed to.
In [1]:
import pandas as pd
Today, we're going to work with a fairly large text file of car accident data from the New Jersey Department of Transportation. This file contains reports from accidents in the Garden State between 2008 and 2013. The data, originally filed in handwritten reports by state troopers and then typed into a fixed-width database by clerks, is both decently large and very messy. (As an aside, if you're really interested in the genesis of this particular file, you can browse the makefile we used to create it.)
Our analysis today will not be comprehensive or particularly accurate, but working with 1.7 million rows of dirty data is a pretty good way to illustrate what you can do with pandas and how it's going to make your life easier.
The first thing we have to do is ingest our data from a source into a pandas DataFrame object. Think of a DataFrame (similar to R's Data Frame) as a virtual spreadsheet. It has columns with unique names and rows with unique row numbers, which we call "the Index". You can read in data from many kinds of sources: json, Excel files, html on the web, sql databases or others. Today, we're going to work with a CSV. First, we're going to create a variable with the name of our CSV, then we'll use pandas' .read_csv() function.
In [2]:
datafile = "njaccidents.csv"
df = pd.read_csv(datafile)
Now we have a DataFrame object with the name df
. Let's use the .head()
method to look at the first five rows of data. Does it look familliar?
In [3]:
df.head()
Out[3]:
It's a table!
To see what we have here, let's make a list of the columns we have to work with the DataFrame's .columns
attribute.
In [4]:
#this is a list that you can pluck items out of.
df.columns
Out[4]:
To get a quick overview of what kind of data is in each column, we can try the .describe()
method.
In [5]:
df.describe()
Out[5]:
In earlier versions of pandas (pre-15), .describe()
gives you the summary statistics for the numeric columns. In newer versions of pandas, using the .describe(include='all')
keyword argument will summarize all the columns.
For categorical columns, like 'Severity'
, you'll get the number of unique values and the most frequent.
In [6]:
df['Severity'].describe()
Out[6]:
Try it again for 'County Name'
.
In [7]:
df['County Name'].describe()
Out[7]:
Say you want to select a single column. You can do this in one of two ways. If the column name is a string without spaces, you can use dot notation, like df.Severity
, for instance. Otherwise, you use a similar syntax to what we're used to with dicts, using brackets, like this: df['County Name']
. If you want to grab more than one column, give it a list of column names.
Now, let's take a look at cleaning messy data in columns. Why does this return an empty DataFrame?
In [8]:
df[df['County Name']=='Passaic']
Out[8]:
With some digging, by using something like df[df['County Name'].str.contains('PASSAIC')]
, you can get a DataFrame of rows that have PASSAIC
in the 'County Name'
column.
In [9]:
df[df['County Name'].str.contains('PASSAIC')].head()
Out[9]:
When we select a single row with the .ix[]
index function, you'll notice that the 'County Name'
column includes a bunch of trailing whitespace: 'PASSAIC '
. Normally, in Python, you might solve this by writing a for loop to cycle through every item in the column and clean it up one at a time.
In [10]:
df['County Name'][df['County Name'].str.contains('PASSAIC')].ix[242727]
Out[10]:
But with Pandas, we can do the same thing much faster. We'll use the .map()
function to perform the .strip()
method on every string in the column at the same time.
In [11]:
df['County Name']=df['County Name'].map(str.strip)
Once our 'County Name'
field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County.
You can try something similar by finding Police Departments that contain the string BLOOMINGDALE
and cleaning them up with the strip function, if you want to.
In [12]:
df['Police Department'][df['Police Department'].str.contains('BLOOMINGDALE')]
Out[12]:
Once our 'County Name' field is cleaned, we can filter the table by its values, returning a view of the DataFrame that only shows rows with accidents that happened in Passaic County, with a comparison using ==
.
In [13]:
df[df['County Name']=='PASSAIC']
Out[13]:
If we're confident the 'County Name'
column is as clean as it's going to be, we can turn to others. What do we see when we look at the unique values in the 'Police Dept Code'
column with the .unique()
method?
In [14]:
df['Police Dept Code'].unique()
Out[14]:
If we filter the column for the police departments that use '99' as their code, we'll see their names are not unique (of course).
In [15]:
df[['Police Dept Code', 'Police Department']][df['Police Dept Code']==99]
Out[15]:
You also might notice that some of our department codes start with 0, so we have data that is mixed between strings and integers in the same column. We can use pandas' .astype()
method to change all the type of all of the values in that column to strings.
In [16]:
df['Police Dept Code']=df['Police Dept Code'].astype(str)
df['Police Dept Code'].unique()
Out[16]:
You should note two things here. To actually change the value of the column in the DataFrame in place, we'll have to assign it back to itself as if we're defining a new variable. Also, if you check the unique values again, we'll see that there are now values for '01' and '1'. You may want to standardize those, but we'll leave that to you as an exercise later.
Even after all that, we still have some empty fields with two spaces in the string, so let's replace those empty values with the word "Unknown"
In [17]:
df['Police Dept Code'][df['Police Dept Code']==' ']='Unknown'
We've done a bit of cleaning here and may want to start doing some exploratory analysis. To do that, we'll create a smaller DataFrame of just the columns that we plan to examine, and we'll name it myframe.
In [18]:
#Create a smaller frame
myframe = df[['County Name', 'Municipality Name', 'Crash Date', 'Crash Day Of Week', 'Crash Time', 'Total Killed', 'Total Injured', 'Pedestrians Killed', 'Pedestrians Injured', 'Total Vehicles Involved', 'Crash Type Code', 'Alcohol Involved', 'Environmental Condition', 'Light Condition', 'Cell Phone In Use Flag']]
We'll pass myframe along to the next exercise for aggregating, but we might want to save it for later. While we're thinking about it, let's write it out to a CSV with the shockingly named .to_csv()
method.
In [19]:
myframe.to_csv('smallertab.csv')
We've done a bit of cleaning to our outrageously messy data, and now we want to start to explore what's actually in this data and look for more spots we have to clean.
If you were reporting on car accidents in your state, what would you want to know? Some of the questions you'd ask might include:
We'll start by reading our data into our DataFrame. Then we'll filter it and do a few group-bys, which should be familiar if you've ever worked with SQL. Finally, we'll create smaller DataFrames of this aggregated data to visualize.
One of the cleaning tasks we'll have to deal with is the format of the 'Crash Date' field, so we might as well import Python's standard datetime
module and read our csv now.
In [20]:
from datetime import datetime
newframe = pd.read_csv('smallertab.csv')
This should all be pretty familiar. Notice that weird warning about mixed types? Although we cleaned the types in some of the columns, others are still mixed. You can solve this by using the dtypes
keyword in the read_csv()
function. It's worth looking at the docs to see all the options you can set when you read a CSV.
Now, we want to see our data aggregated by counties, so here's where pandas' .groupby()
function comes in.
In [21]:
newframe.groupby('County Name')
Out[21]:
Now, when we run .groupby()
, it returns a DataFrameGroupBy object, which is really only good to us if we do something else with it.
So try this:
In [22]:
#What county had the most accident deaths?
newframe.groupby('County Name').sum()
Out[22]:
Let's deconstruct this a bit, because we're chaining a lot together at once here. We start with our DataFrame object, called newframe. We then do .groupby()
, passing it the name of the column we're aggregating on. As we said, this is reminiscent of SQL and returns the DataFrameGroupBy object. We can then run .sum()
on the GroupBy object to add up the numeric columns for each county, returning a new DataFrame, which looks like the kind of Pivot Table we'd make in Excel.
Say we want to know which county had the most people killed in accidents over this time period. We can slice out just the 'Total Killed' column and sort it from most to least.
In [23]:
newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)
Out[23]:
We select the 'Total Killed'
column here using .iloc[]
, which wants a two-item list. The first item is a slice of the rows. Since we want all rows, we'll use the standard Python slicing notation of an empty colon. The second item in the list is the columns we want, which is just column 2, or if you're counting from 0, as Python does, column 1.
Finally, the .order()
function will sort the resulting series and pass it the ascending keyword with a value of False (it defaults to True) to get the largest figures at the top of the list.
We can assign that to a new dataframe variable that we'll use later to make a bar chart and move on for now.
In [24]:
countydeaths = newframe.groupby('County Name').sum().iloc[:,1].order(ascending=False)
Now we want to aggregate the accidents by date. But you may have noticed that the dates here are strings and not actual Python date objects. This means when you try to sort them, you get 01/01/2008, 01/01/2009, 01/01/2010, and so on. We need convert the strings to actual Python dates.
In [25]:
#we use lambda, which creates an unnamed, one-line function and applies it to each item in the column
newframe['Crash Date']=newframe['Crash Date'].apply(lambda x: datetime.strptime(x, "%m/%d/%Y").date())
What did we do here? We take the 'Crash Date' column and use the .apply()
function to perform some operation across the value in every row in that column. Then we use Python's lambda, a way to write an unnamed, one-line function. If the variable for the date string in each row of the 'Crash Date' column is x, apply and lambda perform the .strptime()
function from Python's datetime module. The part in double quotes "%m/%d/%Y" says to take a string that is a two-digit month and a two-digit date and a four-digit year, divided by slashes, and convert it to a datetime object. Finally, .date()
takes just the date part of that object and returns that into the row. That's a mouthful.
Now that we've done that, we can aggregate the crashes by real, sortable dates and count how many accidents happened on each date. Then we'll assign this to a new variable for the resulting DataFrame that we'll use later to make a line chart.
In [26]:
crashesbydate = newframe.groupby('Crash Date').count().iloc[:,0]
So far we've just been transforming and tweaking data that already exists in our dataframe. We can also calculate new data and add it as a column. Here we can create a column for the total number of people involved in an accident by adding together the 'Total Killed' and 'Total Injured' columns. This syntax works similarly to creating a new key in a Python dict.
In [27]:
newframe['Total Involved']=newframe['Total Killed']+newframe['Total Injured']
With this 'Total Involved' value, we can create a histogram later.
In [28]:
newframe['Crash Day Of Week']=newframe['Crash Day Of Week'].map(str.strip)
Here, we're using .map()
to call the pandas built-in strip method on its string method across the column, which is a Series object
Finally, let's create a DataFrame of all the columns aggregated by county so we can make a scatter plot of the counties by total killed and pedestrians killed.
In [29]:
countyframe = newframe.groupby('County Name').sum()
With these new DataFrames, we have several slices of our data. Let's see what it looks like by visualizing it.
First, we need to launch the pylab mode inline to get the plots to render inside our notebook. You can do this at startup by running ipython notebook --pylab=inline
, or you can do it like this with the magic %
.
In [30]:
%pylab inline
Next, we import matplotlib's pyplot class and call it plt.
In [31]:
import matplotlib.pyplot as plt
If we want to make a bar chart of the number of deaths by county, from the countydeaths
DataFrame we created above, we can use the .plot()
function and pass it the kind='bar'
keyword argument.
These plots are basically pandas' wrapper around the matplotlib library. There are many other things you can do with the wrapper and matplotlib itself, and it's worth giving the documentation a read.
In [32]:
countydeaths.plot(kind='bar')
Out[32]:
If we want to do a line graph, we can use the .plot()
function on a DataFrame. The default kind
is a line graph. So here's a graph of crashes by date.
In [33]:
crashesbydate.plot()
Out[33]:
In [34]:
newframe['Total Involved'].hist(bins=50)
Out[34]:
Finally, for now, we can scatter plot our county data by the total number of people killed versus the number of pedestrians killed.
In [35]:
#let's scatter plot county name by total killed and pedestrians killed.
countyframe.plot(kind='scatter', x='Total Killed', y='Pedestrians Killed')
Out[35]:
I hope this was helpful. If you have questions or spot bugs, you can find me on Github, via email or on Twitter.